[python] Support query auth (row filter & column masking) for REST catalog#8136
[python] Support query auth (row filter & column masking) for REST catalog#8136MgjLLL wants to merge 6 commits into
Conversation
|
I found a few correctness issues in the query-auth paths introduced here:
|
|
Fixes for issues raised by @JingsongLi, plus one additional issue found during analysis. Fix 1: Ray read path bypasses auth (
|
@JingsongLi All 3 issues fixed (+ 1 additional parallel path bypass found during analysis). See updated PR description. PTAL. |
| if not self.filter and not self.column_masking: | ||
| return plan | ||
| auth_splits = [QueryAuthSplit(split, self) for split in plan.splits()] | ||
| return Plan(auth_splits) |
There was a problem hiding this comment.
Java TableScan.Plan does not carry a snapshot id, but Python Plan does and the update / row-id update paths use it as check_from_snapshot. Wrapping the plan here drops plan.snapshot_id, so a query-auth table planned from a non-empty snapshot becomes snapshot_id=None; table_update then emits commit messages with -1, which disables the row-id conflict checks (and related global-index update checks). Please preserve the original plan metadata, e.g. Plan(auth_splits, snapshot_id=plan.snapshot_id).
|
|
||
| return reader | ||
|
|
||
| def _create_split_read_with_read_type(self, split, read_type): |
There was a problem hiding this comment.
This auth-specific construction bypasses the normal PK read path above. In _create_split_read, PK tables inject missing sequence.field columns into the inner read type and then project them back out, matching the Java withReadType + outer projection behavior. Here, if query auth is enabled and the user projects id,val from a PK table with sequence.field=ts, MergeFileSplitRead is built without ts; that can either fail with sequence.field ... not found or merge by file sequence instead of the configured user sequence. Please reuse the existing _create_split_read widening/project-back logic for effective_read_type, or factor it so the auth path cannot drift from the normal PK path.
| elif function == "LIKE": | ||
| raw = literals[0] | ||
| escaped = re.escape(raw) | ||
| pattern = escaped.replace("%", ".*").replace("_", ".") |
There was a problem hiding this comment.
This does not match the JVM LIKE semantics. Java treats backslash as the default escape character before expanding % / _, so a policy predicate like LIKE admin\\_% matches admin_foo and not adminXfoo. Escaping the whole string first and then replacing every % / _ makes escaped wildcards behave as wildcards (or requires a literal backslash), so Python can allow/deny different rows from the Java client for the same auth filter. Please port the Java Like.sqlToRegexLike behavior, including invalid escape handling.
|
@JingsongLi All 3 issues fixed (+ several related correctness issues found during follow-up self-review). New commit Fixes for the three review comments1.
|
…talog
Adds query-auth support to the Python client so it honors the row-level
filter and column masking rules returned by a REST catalog, matching the
existing JVM client behavior.
When the new option `query-auth.enabled` is set to true, the client
calls `POST /v1/.../databases/{db}/tables/{tb}/auth` before producing a
plan, receives `{ filter, columnMasking }`, and applies them on the
read path:
* `predicate_json_parser` parses Paimon predicate JSON into a
PyArrow compute filter (EQ/NEQ/LT/LTEQ/GT/GTEQ/IS_NULL/IS_NOT_NULL/
IN/NOT_IN/STARTS_WITH/ENDS_WITH/CONTAINS/AND/OR/NOT).
* `AuthFilterReader` / `AuthMaskingReader` / `ColumnProjectReader`
perform row filtering, column masking transforms (NULL, FIELD_REF,
CAST, UPPER, LOWER, CONCAT, CONCAT_WS) and final projection back to
the user's requested columns.
* `TableQueryAuth` / `TableQueryAuthResult` wrap the result and
convert each split to a `QueryAuthSplit`.
Behavior is gated by `CoreOptions.QUERY_AUTH_ENABLED` (default false),
so existing users see no change.
- Ray: use table.new_read_builder() instead of direct ReadBuilder() - Streaming: pass query_auth to AsyncStreamingTableScan, apply to all plans - Merge reader: add RecordReaderToBatchAdapter for primary-key tables - Parallel: use _create_reader_for_split, add raw_convertible proxy
Return None instead of a local lambda from table_query_auth() when auth is disabled, since pickle cannot serialize local lambdas. This fixes serializable_test and ray_sink_test failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses the three review comments from @JingsongLi on commit 482fdad, plus related correctness issues found during follow-up self-review. 1. Preserve Plan.snapshot_id in TableQueryAuthResult.convert_plan Java TableScan.Plan does not carry a snapshot id, but Python Plan does, and table_update / table_update_by_row_id rely on it as check_from_snapshot. Wrapping was dropping snapshot_id, so query-auth tables planned from a non-empty snapshot lost row-id conflict / global-index update checks. 2. Reuse the normal PK read path instead of a parallel auth-only one Removed _create_split_read_with_read_type and made _create_split_read accept an optional read_type. The auth path now goes through the same widening/project-back logic that injects sequence.field columns for PK MergeFileSplitRead, so `id,val` projection on a PK table with sequence.field=ts no longer fails or merges by file sequence. 3. Port Java Like.sqlToRegexLike escape semantics `LIKE admin\_%` now matches `admin_foo` and not `adminXfoo`, matching JVM behavior. Backslash is the default escape character; invalid escape sequences raise instead of being silently treated as wildcards. 4. Streaming path now applies query auth to every plan StreamReadBuilder forwards query_auth and read_type to AsyncStreamingTableScan, which calls _apply_auth on initial, follow-up, and catch-up plans. Catch-up reuses _create_initial_plan_raw to avoid double auth. 5. Iterator / parallel / Ray paths route through _create_reader_for_split, so QueryAuthSplit detection is centralized. to_iterator gains a RecordBatchReader branch for PyTorch / generic iterator consumers of auth-wrapped splits. scan_with_stats also applies auth for parity with plan(). 6. Row kind preserved through the auth pipeline on PK tables RecordReaderToBatchAdapter encodes row_kind into a `_row_kind` column when include_row_kind=True; ColumnProjectReader keeps `_row_kind` even when the user projection drops it; to_iterator restores it via OffsetRow.set_row_kind_byte without leaking the column into row_tuple. 7. RecordReaderToBatchAdapter no longer drops rows when an inner read_batch yields more than chunk_size rows: extra rows are carried over to the next flush. 8. QueryAuthSplit attribute delegation __getattr__ forwards file_size, file_paths, data_deletion_files, raw_convertible, etc. to the inner split for Ray/explain/Daft paths, while guarding _-prefixed names to avoid pickle recursion. 9. Daft explain_scan now reports the same reader_mode as the real read path. ExplainSplitInfo carries has_auth, _build_explain_result populates it from QueryAuthSplit, and Daft's _split_has_auth accepts both QueryAuthSplit and the explain descriptor so "query auth active" appears as a fallback reason consistently. 10. table_query_auth returns None (not a local lambda) when auth is disabled, so FileStoreTable remains pickle-safe for Ray. - New unit tests cover snapshot_id preservation, row_kind through the adapter / project reader, chunk-size carry-over, and QueryAuthSplit attribute delegation incl. pickle round-trip. - All query-auth related tests (45) pass under pytest: `pytest pypaimon/tests/ -k "explain or query_auth or table_query_auth"` Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
99efdcc to
bf32ff2
Compare
bf32ff2 to
15ee1c1
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
I found a few correctness issues in the Python query-auth implementation.
| elif function == "IN": | ||
| return pc.is_in(value_array, pa.array(converted, type=value_array.type)) | ||
| elif function == "NOT_IN": | ||
| return pc.invert(pc.is_in(value_array, pa.array(converted, type=value_array.type))) |
There was a problem hiding this comment.
Java Paimon's NotIn returns false when the input field is null, but pc.is_in(null, ...) is false and inverting it makes null rows pass. For a row-filter auth rule such as dept NOT_IN ('blocked'), Python would expose rows where dept is null. Please combine this with pc.is_valid(value_array) and also preserve Java's null-literal behavior, where any null literal makes NOT_IN false.
| self._parsed_rules = {col: json.loads(tj) for col, tj in masking_rules.items()} | ||
| read_field_names = {f.name for f in read_fields} | ||
| for col_name, transform in self._parsed_rules.items(): | ||
| for ref_name in _collect_all_field_refs_from_transform(transform): |
There was a problem hiding this comment.
This validates references for every masking rule before checking whether the masked target column is actually projected. If REST returns a rule like secret = FIELD_REF(email) and the user reads only id, the Python reader raises because email is absent even though secret is not returned. Java skips masking rules whose target column is absent from the output row type before remapping inputs, so this should filter to projected target columns first.
| return pa.array([True] * batch_len, type=pa.bool_()) | ||
| elif function == "FALSE": | ||
| return pa.array([False] * batch_len, type=pa.bool_()) | ||
| raise ValueError(f"Unknown leaf function: {function}") |
There was a problem hiding this comment.
IS_NAN is a valid Java Paimon predicate (PredicateBuilder.isNaN / IsNaN.NAME) and can be serialized in REST auth filters. With the current switch it fails every Python read with Unknown leaf function: IS_NAN; please add a branch using pc.is_nan for float/double arrays.
…r auth predicates
Purpose
Adds query-auth support to the Python client so it honors the row-level filter and column masking rules returned by a REST catalog, matching the existing JVM client behavior.
When the new option
query-auth.enabledis set totrue, before producing aPlanthe client callsPOST /v1/.../databases/{db}/tables/{tb}/authwith the projected fields, receives{ filter, columnMasking }, and applies them on the read path:RESTApi.auth_table_queryissues the call (new request/response modelsAuthTableQueryRequest/AuthTableQueryResponse, new path inResourcePaths.auth_table).TableQueryAuth/TableQueryAuthResult(catalog/table_query_auth.py) wrap the result and convert each split to aQueryAuthSplit.predicate_json_parser(common/predicate_json_parser.py) parses Paimon predicate JSON into a PyArrow compute filter (EQ/NEQ/LT/LTEQ/GT/GTEQ/IS_NULL/IS_NOT_NULL/IN/NOT_IN/STARTS_WITH/ENDS_WITH/CONTAINS/AND/OR/NOT).AuthFilterReader/AuthMaskingReader/ColumnProjectReader(read/reader/auth_masking_reader.py) implement row filtering, column masking transforms (NULL,FIELD_REF,CAST,UPPER,LOWER,CONCAT,CONCAT_WS) and final projection back to the user's requested columns.read_builder/stream_read_builder/table_read/table_scan/file_store_table/catalog_environment/rest_catalogare wired to invoke the auth call and pull extra fields required only by the auth filter.Behavior is gated by the new
CoreOptions.QUERY_AUTH_ENABLED(query-auth.enabled, defaultfalse), so existing users see no change.Tests
Three new test files (994+ lines, all passing locally under
pytest):paimon-python/pypaimon/tests/predicate_json_parser_test.py— covers each predicate kind, nested AND/OR/NOT, type coercion, null handling, andextract_referenced_fields.paimon-python/pypaimon/tests/auth_masking_reader_test.py— covers each masking transform, missing-field validation, and projection back to the user-requested columns.paimon-python/pypaimon/tests/table_query_auth_test.py— end-to-end coverage: REST catalog callsauth_table_query, the result is plumbed into the plan, splits becomeQueryAuthSplit, and reads return filtered + masked rows.Local check:
API and Format
query-auth.enabled(boolean, defaultfalse).POST /v1/{prefix}/databases/{db}/tables/{tb}/auth. Request{ "select": [...] }, response{ "filter": [<predicate-json>...], "columnMasking": { <col>: <transform-json>, ... } }. The contract follows the existing Java client; no server-side change is required for catalogs that already implement query auth.AuthTableQueryRequest,AuthTableQueryResponse,TableQueryAuth,TableQueryAuthResult,QueryAuthSplit,AuthFilterReader,AuthMaskingReader,ColumnProjectReader) are additive and live under existing modules.Documentation
The new option
query-auth.enabledshould be reflected in the Python configuration reference. Happy to add the docs entry in this PR or in a follow-up — please advise.This closes #8135